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Abstract —We describe FactorBase , a new SQL-based frame¬ 
work that leverages a relational database management system to 
support multi-relational model discovery. A multi-relational sta¬ 
tistical model provides an integrated analysis of the heterogeneous 
and interdependent data resources in the database. We adopt the 
BayesStore design philosophy: statistical models are stored and 
managed as first-class citizens inside a database i30l . Whereas 
previous systems like BayesStore support multi-relational infer¬ 
ence, FactorBase supports multi-relational learning. A case 
study on six benchmark databases evaluates how our system 
supports a challenging machine learning application, namely 
learning a first-order Bayesian network model for an entire 
database. Model learning in this setting has to examine a large 
number of potential statistical associations across data tables. 
Our implementation shows how the SQL constructs in Factor- 
Base facilitate the fast, modular, and reliable development of 
highly scalable model learning systems. 

1. Introduction 

Data science brings together ideas from different fields for 
extracting value from large complex datasets. The system de¬ 
scribed in this paper combines advanced analytics from multi- 
relational or statistical-relational machine learning (SRL) with 
database systems. The power of combining machine learn¬ 
ing with database systems has been demonstrated in several 
systems idi, ca, 0 . The novel contribution of FACTOR- 
BASE is supporting machine learning for multi-relational 
data, rather than for traditional learning where the data are 
represented in a single table or data matrix. We discuss new 
challenges raised by multi-relational model learning compared 
to single-table learning, and how FactorBase solves them 
using the resources of SQL (Structured Query Language). The 
name FactorBase indicates that our system supports learn¬ 
ing factors that define a log-linear multi-relational model ifT^ . 
Supported new database services include constructing, storing, 
and transforming complex statistical objects, such as factor- 
tables, cross-table sufficient statistics, parameter estimates, and 
model selection scores. 

Multi-relational data have a complex structure, that in¬ 
tegrates heterogeneous information about different types of 
entities (customers, products, factories etc.) and different types 
of relationships among these entities. A statistical-relational 
model provides an integrated statistical analysis of the het¬ 
erogeneous and interdependent complex data resources main¬ 
tained by the database system. Statistical-relational models 
have achieved state-of-the-art performance in a number of 
application domains, such as natural language processing, 


ontology matching, information extraction, entity resolution, 
link-based clustering, query optimization, etc. El, GOl, 0. 
Database researchers have noted the usefulness of statistical- 
relational models for knowledge discovery and for representing 
uncertainty in databases 1 ^ . 1301 . They have developed a 
system architecture where statistical models are stored as first- 
class citizens inside a database. The goal is to seamlessly 
integrate query processing and statistical-relational inference. 
These systems focus on inference given a statistical-relational 
model, not on learning the model from the data stored in 
the RDBMS. FactorBase complements the in-database 
probabilistic inference systems by providing an in-database 
probabilistic model learning system. 

A. Evaluation 

We evaluate our approach on six benchmark databases. 
For each benchmark database, the system applies a state- 
of-the-art SRL algorithm to construct a statistical-relational 
model. Our experiments show that FactorBase pushes 
the scalability boundary: Learning scales to databases with 
over 10^ records, compared to less than 10^ for previous 
systems. At the same time it is able to discover more complex 
cross-table correlations than previous SRL systems. We report 
experiments that focus on two key services for an SRL client: 
(1) Computing and caching sufficient statistics, (2) computing 
model predictions on test instances. For the largest benchmark 
database, our system handles 15M sufficient statistics. SQL 
facilitates block-prediction for a set of test instances, which 
leads to a 10 to 100-fold speedup compared to a simple loop 
over test instances. 

B. Contributions 

FactorBase is the first system that leverages relational 
query processing for learning a multi-relational log-linear 
graphical model. Whereas the in-database design philosophy 
has been previously used for multi-relational inference, we are 
the first to adapt it for multi-relational model structure learning. 
Pushing the graphical model inside the database allows us to 
use SQL as a high-level scripting language for SRL, with the 
following advantages. 

1) Extensibility and modularity, which support rapid pro¬ 
totyping. SRL algorithm development can focus on 
statistical issues and rely on a RDBMS for data access 
and model management. 
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2) Increased scalability, in terms of both the size and the 
complexity of the statistical objects that can be handled. 

3) Generality and portability: standardized database opera¬ 
tions support “out-of-the-box” learning with a minimal 
need for user configuration. 

C. Paper Organization 

We provide an overview of the system components and 
flow. For each component, we describe how the component 
is constructed and managed inside an RDBMS using SQL 
scripts and the SQL view mechanism. We show how the system 
manages sufficient statistics and test instance predictions. The 
evaluation section demonstrates the scalability advantages of 
in-database processing. The intersection of machine learning 
and database management has become a densely researched 
area, so we end with an extensive discussion of related work. 

IT Background on Statistical-Relational 
Learning 

We review enough background from statistical-relational 
models and structure learning to motivate our system design. 
The extensive survey by Kimmig et al ifT^ provides further 
details. The survey shows that SRL models can be viewed as 
log-linear models based on par-factors, as follows. 

A. Log-linear Template Models for Relational Data 

Par-factor stands for “parametrized factor”. A par factor rep¬ 
resents an interaction among parametrized random variables, or 
par-RVs for short. We employ the following notation for par- 
RVs d 2.2 .5]. Constants are expressed in lower-case, e.g. 
joe, and are used to represent entities. A type is associated 
with each entity, e.g. joe is a person. A first-order variable is 
also typed, e.g. Person denotes some member of the class of 
persons. A functor maps a tuples of entities to a value. We 
assume that the range of possible values is finite. An atom 
is an expression of the form r(ri,... ,ra) where each is 
either a constant or a first-order variable. If all of ri,..., are 
constants, r(ri,..., r^) is a ground atom or random variable 
(RV), otherwise a first-order atom or a par-RV. A par-RV is 
instantiated to an RV by grounding, i.e. substituting a constant 
of the appropriate domain for each first-order variable. 

A par-factor is a pair ^ = (A, fi), where A is a set of par- 
RVs, and 0 is a function from the values of the par-RVs to 
the non-negative real numbers]^ Intuitively, a grounding of a 
par-factor represents a set of random variables that interact 
with each other locally. SRL models use parameter tying, 
meaning that if two groundings of the same par-factor are 
assigned the same values, they return the same factor value. 
A set of parfactors T defines a joint probability distribution 
over the ground par-RVs as follows. Let T{^i) denote the 
set of all ground par-RVs in par-factor ^i. Let x be a joint 
assignment of values to all ground random variables. Notice 
that this assignment determines the values of all ground atoms. 
An assignment X = x is therefore equivalent to a single 

^A par-factor can also include constraints on possible groundings. 


database instance. The probability of a database instance is 
given by the log-linear equation (HI Eq.7]: 

p(x=x)=|n n (1) 

where xa represents the values of those variables in A that are 
necessary to compute fii. Equationcan be read as follows. 

1) Instantiate all parfactors with all possible constants. 

2) For each ground par-factor, apply the values specified 
by the joint assignment X = x, and compute the 
corresponding factor term. 

3) Multiply all the factors together, and normalize the 
product. 

Typically the number of ground par-RVs will be very 

large. Equation can be evaluated, without enumerating the 
ground par-factors, as follows. 

1) For each par-factor, for each possible assignment of 
values, find the number of ground factors with that 
assignment of values. 

2) Raise the factor value for that assignment to the number 
of instantiating factors. 

3) Multiply the exponentiated factor values together, and 
normalize the product. 

The number 2) of ground factors with the same assignment 
of values is known as a sufficient statistic of the log-linear 
model. 



Eigure 1. A relational ER Design for a university domain. 
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Figure 2. Database Table Instances: (a) Student, (b) RA, (c) Professor. 


B. Examples 

SRL has developed a number of formalisms for describing 
par-factors {T4 \ . First-order probabilistic graphical models are 
popular both within SRL and the database community (TH . 
(301 . The model structure is defined by edges connecting 
par-RVs. For instance, a parametrized Bayesian network 
structure is a directed acyclic graph whose nodes are par-RVs. 
Figure shows a Bayesian network for a University domain. 
We use the university example as a toy running example 
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Figure 3. (a) Bayesian network for the University domain. We omit 

the Registered relationship for simplicity. The network was learned 
from the University dataset m. (b) Conditional Probability table 
Capability{¥, S)_CPT, for the node Capability{¥, S). Only value combi¬ 
nations that occur in the data are shown. This is an example of a factor table, 
(c) Contingency Table Capability(F, E>)_CT for the node Capability{F, S) 
and its parents. Both CP and CT tables are stored in an RDBMS. 


throughout the paper. The schema for the university domain is 
given in Figure This schema features only one relationship 
for simplicity; FactorBase learns a model for any number 
of relationships. While we describe FactorBase abstractly 
in terms of par-factors, for concreteness we illustrate it using 
Bayesian networks. The system takes as input a database 
instance like that shown in Figure and produces as output 
a graphical model like that shown in Figure 

A par-factor in a Bayesian network is associated with a 
family of nodes (HI Sec.2.2.1]. A family of nodes comprises 
a child node and all of its parents. For example, in the BN 
of Figure one of the par-factors is associated with the 
par-RV set A = { Capability{F, S), SalaryRA{F,E))}. 
For the database instance of Figure there are 3 x 3 = 9 
possible factors associated with this par-RV, corresponding 
to the Cartesian product of 3 professors and 3 students. 
The value of the factor 0 is a function from an assignment 
of family node values to a non-negative real number. In a 
Bayesian network, the factor value represents the conditional 
probability of the child node value given its parent node values. 
These conditional probabilities are typically stored in a table 
as shown in Figure |^b). This table represents therefore the 
function f associated with the family par-factor. Assuming 
that all par-RVs have finite domains, a factor can always be 
represented by a factor table of the form Figure |^b): there is a 
column for each par-RV in the factor, each row specifies a joint 
assignment of values to a par-RV, and the factor column gives 
the value of the factor for that assignment (cf. (HI Sec.2.2.1]). 

The sufficient statistics for the Capability {F family can 
be represented in a contingency table as shown in Figure [^c). 
For example, the first row of the contingency table indicates 


that the conjunction 

CapabilityiF^B) = n/a., Salary{F^B) = n/a, RA{F,B) = F 

is instantiated 203 times in the University database (publicly 
available at (HI)- This means that for 203 professor-student 
pairs, the professor did not employ the student as an RA (and 
therefore the salary and capability of this RA relationship is 
undefined or n/a). 

C. SRL Structure Learning 

Algorithm shows the generic format of a statistical- 
relational structure learning algorithm (adapted from (HI)- The 
instantiation of procedures in lines 2, 3, 5 and 8 determines the 
exact behavior of a specific learning algorithm. The structure 
algorithm carries out a local search in the hypothesis space of 
graphical relational models. A set of candidates is generated 
based on the current model (line 3), typically using a search 
heuristic. For each candidate model, parameter values are esti¬ 
mated that maximize a model selection score function chosen 
by the user (line 5). A model selection score is computed for 
each model given the parameter values, and the best-scoring 
candidate model is selected (line 7). We next discuss our 
system design and how it supports model discovery algorithms 
that follow the outline of Algorithm Figure outlines the 
system components and dependencies among them. 


Algorithm 1: Structure learning algorithm 

Input: Hypothesis space H (describing graphical 

models), training data V (assignments to random 
variables), scoring function score (•, P) 

Output: A graph structure G representing par-factors. 

1: G4-0 

2: while continue(G, U, score (•, V) ) do 
3: refineCandidates(G, V) 

4: for each i? G 7^ do 

5: RC- LEARNPARAMETERS(i?,SCOre (•, V)) 

6: end for 

7: G ^ argmaxG'/G 7 ^u{G} score(G', V) 

8 : end while 
9: return G 


III. The Random Variable Database 

Statistical-relational learning requires various metadata 
about the par-RVs in the model. These include the following. 
Domain the set of possible values of the par-RV. 

Types Pointers to the first-order variables in the par-RV. 

Data Link Pointers to the table and/or column in the input 
database associated with the par-RV. 

The metadata must be machine-readable. Following the in¬ 
database design philosophy, we store the metadata in tables so 
that an SRL algorithm can query it using SQL. The schema 
analyzer uses an SQL script that queries key constraints 
in the system catalog database and automatically converts 
them into metadata stored in the random variable database 
VDB. In contrast, existing SRL systems require users to 
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Figure 4. System Flow. All statistical objects are stored as first-class citizens in a DBMS. Objects on the left of an arrow are utilized for constructing objects 
on the right. Statistical objects are constructed and managed by different modules, shown as boxes. 


TABLE I. Translation from ER Diagram to Par-RVs 


ER Diagram 

Example 

par-RV equivalent 

Entity Set 

Student, Course 

S,C 

Relationship Set 

RA 

RA(P,§) 

Entity Attributes 

intelligence, ranking 

Intelligence(S), Ranking(§) 

Relationship Attributes 

capability, salary 

Capability(P, §), Salary(P, §) 


specify information about par-RVs and associated types. Thus 
FactorBase utilizes the data description resources of SQL 
to facilitate the “setup task” for relational learning 1291 . Due 
to space constraints, we do not go into the details of the 
schema analyzer. Instead, we illustrate the general principles 
with the ER diagram of the University domain (Figure [^. The 
Appendix provides a full description with MySQL script. 

The translation of an ER diagram into a set of functors 
converts each element of the diagram into a functor, except for 
entity sets and key fields ifTol . Table illustrates this transla¬ 
tion. In terms of database tables, attribute par-RVs correspond 
to columns. Relationship par-RVs correspond to tables, not 
columns. Including a relationship par-RV in a statistical model 
allows the model to represent uncertainty about whether or not 
a relationship exists between two entities ca. The values of 
descriptive attributes of relationships are undefined for entities 
that are not related. We represent this by introducing a new 
constant n/a m the domain of a relationship attribute (TEh : 
see Figure (right). Table [II] shows the schema for some of 
the tables that store metadata for each relationship par-RV, as 
follows. par-RV and FO-Var are custom types. 

Relationship The associated input data table. 
Relationship_Attributes Descriptive attributes associated 
with the relationship and with the entities involved. 
Relationship_FOVariables The first-order variables con¬ 
tained in each relationship par-RV0 


TABLE II. Selected Tables In the Variable Database Schema. 


Table Name 

Column Names 

Relationship 

RVarlD: par-RV, TABLE NAME: string 

Relationship Attributes 

RVarlD: par-RV, AVarlD: par-RV, EO-ID: EO-Var 

Relationship FOvariables 

RVarlD: par-RV, EO-ID: EO-Var, TABLE NAME: string 


While we have described constructing the variable database 
for an ER model, different structured data models can be 

^The schema assumes that all relationships are binary. 


1 select from AttribiiteColmnns ; 

2 select * from "'Domain^; 


/l^AttributeColumns fZxlOl\ /H Domain (2x33)\ 
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Figure 5. The metadata about attributes represented in VDB database 
tables. Left: The table AttrihuteColumns specifies which tables and columns 
contain the functor values observed in the data. The column name is also the 
functor ID. Right: The table Domain lists the domain for each functor. 

represented by an appropriate first-order logic vocabulary flAl , 
that is, an appropriate choice of functors. For example, in a star 
schema, facts can be represented in the form /(Bi,..., D/.), 
where the first-order variable B^ ranges over the primary key of 
dimension table i. Attributes of dimension i can be represented 
by a unary functor a(B^). FactorBase can perform structure 
learning for different data models after the corresponding data 
format has been translated into the VDB format. 

IV. The Count Manager 

The count database CDB stores a set of contingency 
tables. Contingency tables represent sufficient statistics as 
follows |[T^ . Consider a fixed list of par-RVs. A query is a 
set of {variable = value) pairs where each value is of a valid 
type for the variable. The result set of a query in a database 
V is the set of instantiations of the logical variables such that 
the query evaluates as true in V. For example, in the database 
of Figure the result set for the query 
RA{¥,S) = T, Capability{¥,S>) = 3, Salary{¥,S>) = high 
is the singleton {{jack, Oliver)}. The count of a query is the 
cardinality of its result set. 

Every set of par-RVs V = {Vi,... ,Vn} has an associated 
contingency table (CT) denoted by CT(V). This is a table 
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with a row for each of the possible assignments of values to 
the variables in V, and a special integer column called count. 
The value of the count column in a row corresponding to 
Vi = vi,... ,Vn = Vn records the count of the corresponding 
query. Figure [^b) shows a contingency table for the par-RVs 
RA{¥,S>), CapabilitySalary The contingency 
table problem is to compute a contingency table for par-RVs 
V and an input database V. 

SQL Implementation With Metaqueries. We describe how 
the contingency table problem can be solved using SQL. This 
is relatively easy for a fixed set of par-RVs; the challenge is a 
general construction that works for different sets of par-RVs. 
For a fixed set, a contingency table can be computed by an 
SQL count(*) query of the form 

CREATE VIEW CT-table(<VARIABLE-LIST>) AS 

SELECT COUNT(^) AS count, <VARIABLE-LIST> 

EROM TABLE-LIST 

GROUP BY VARIABLE-LIST 

WHERE <Join-Conditions> 

FactorBase uses SQL itself to construct the count- 
conjunction query. We refer to this construction as an SQL 
metaquery. We represent a count(*) query in four kinds of 
tables: the Select, From, Where and Group By tables. Each 
of these tables lists the entries in the corresponding count(*) 
query part. Given the four metaquery tables, the corresponding 
SQL count(*) query can be easily constructed and executed in 
an application to construct the contingency table. Given a list 
of par-RVs as input, the metaquery tables are constructed as 
follows from the metadata in the database VDB. 

FROM LIST Find the tables referenced by the par-RV’s. A 
par-RV references the entity tables associated with its 
first-order variables (see VDB.Relationship_FOvariables). 
Relational par-RV’s also reference the associated relation¬ 
ship table (see VDB.Relationship). 

WHERE LIST Add join conditions on the matching primary 
keys of the referenced tables in the WHERE clause. The 
primary key columns are recorded in VDB. 

SELECT LIST For each attribute par-RV, find the corre¬ 
sponding column name in the original database (see 
VDB.AttributeColumns). Rename the column with the ID 
of the par-RV. Add a count column. 

GROUP BY LIST The entries of the Group By table are the 
same as in the Select table without the count column. 

Figure shows an example of a metaquery for the university 
database. This metaquery defines a view that in turn defines 
a contingency table for the random variable list associated 
with the relationship table RA. This list includes the en¬ 
tity attributes of professors and of students, as well as the 
relationship attributes of the RA relationship. The Bayesian 
network of Figure was learned from this contingency table. 
The contingency table defined by the metaquery of Figure 
contains only rows where the value of RA is true. The Mobius 
Virtual Join flSl can be used to extend this contingency table 
to include counts for when RA is false, like the table shown 
in Figure [^c). 


Metaqueries 

Entries 

CREATE VIEW Seleet List AS 

SELECT RVarlD, CONCAT(‘COUNT(*)',' as "count"') AS Entries 

FROM VDB.Relationship 

UNION DISTINCT 

SELECT RVarlD, AVarlD AS Entries 

FROM VDB.RelationshipAttributes; 

COUNT(*) as “count" 

'Popularity(P)' 

' Teachingability (P)' 

'lntelUgence(S)' 

'Ranking(S)' 

CREATE VIEW From_List AS 

SELECT RVarlD, CONCAT('@database@.',TABLE_NAME) AS Entries 
FROM VDB.Relationship_FOvariables 

UNION DISTINCT 

SELECT RVarlD, CONCAT('@database@.',TABLE_NAME) AS Entries 
FROM VDB.Relationship; 

@database@.prof AS P 

@database@.student AS S 

@database@.RA AS 'RA' 

CREATE VIEW Where_List AS 

SELECT 

RVarlD, CONCAT(RVarID,'.',COLUMN_NAME,' =', 

FO-ID,'.', REFERENCED_COLUMN_NAME) AS Entries 

FROM VDB.Relationship_FOvariables; 

'RA'.p_id = P.p_id 

'RA'.s_id =S.s_id 


Figure 6. Example of metaquery results based on university database and 
the par-RV metadata (Table [II). 


V. The Model Manager 

The Model Manager provides two key services for 
statistical-relational structure learning: 1) Estimating and stor¬ 
ing parameter values (line 5 of Algorithm[^. 2) Computing one 
or more model selection scores (line 7 of Algorithm [T]). Fac- 
TORBase uses a store+score design for these services, which 
is illustrated in Figure A model structure table represents a 
candidate model. When a candidate model structure is inserted, 
a view uses the sufficient statistics from a contingency table 
to compute a table of parameter values. Another view uses the 
parameter values and sufficient statistics together to compute 
the score for the candidate model. 



A. The MDB Schema 

The relational schema for the Models Database is shown 


in Table III The @par-RVID@ parameter refers to the ID of 
a par-RV, for instance Capability(F,S). The model manager 
stores a set of factor tables (cf. Section II-B| ). In a graphical 
model, each factor is defined by the local topology of the 
model template graph. For concreteness, we illustrate how 
factor tables can be represented for Bayesian networks. The 
graph structure can be stored straightforwardly in a database 
table BaycsNet whose columns are child and parent. The 
table entries are the IDs of par-RVs. For each node, the MDB 
manages a conditional probability table. This is a factor table 
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TABLE III. The main tables in the Models Database MDB. 
For a Bayesian network, the MDB stores its structure, 

PARAMETER ESTIMATES, AND MODEL SELECTION SCORES. 


BayesNet(child:par-RV,parent:par-RV) 

@par-RVID@_CPT(@par-RVID@:par-RV,parent^ :par-RV,. . . , parent^ :par-RV,cp:real) 
Scores(child:par-RV,loglikelihood:real,#par:int,aic:real) 


that represents the factor associated with the node’s family (see 
Figure [^b)). In a Bayesian network, model selection scores are 
decomposable. This means that there is a local score associated 
with each family, such that the total score for the BN model is 
the sum of the local scores. For each family, the local score is 
stored in the Scores table indexed by the family’s child node. 


B. Parameter Manager 

Deriving predictions from a model requires estimating val¬ 
ues for its parameters. Maximizing the data likelihood is the 
basic parameter estimation method for Bayesian networks. The 
maximum likelihood estimates equal the observed frequency 
of a child value given its parent values. 

SQL Implementation With Natural Join. Given the sufficient 
statistics in a contingency table, a conditional probability table 
containing the maximum likelihood estimates can be computed 
by aggregation using SQL as in the example below. 

SELECT count/temp.parent_count as CP, 
capability(P,S), RA(P,S), salary(P,S) 

EROM capability(P,S)_CT 
NATURAL JOIN 

(SELECT sum(Count) as parent_count, 

RA(P,S), salary(P,S) 

EROM capability(P,S)_CT 

GROUP BY RA(P,S), salary (P,S) ) as temp 


C. Model Score Computation 

A typical model selection approach is to maximize the 
likelihood of the data, balanced by a penalty term. For instance, 
the Akaike Information Criterion (AIC) is defined as follows 

AIC{G,V) = ln{PQ{V)) - #par{G) 

where G is the BN G with its parameters instantiated to be 
the maximum likelihood estimates given the database V, and 
^par{G) is the number of free parameters in the structure G. 
The number of free parameters for a node is the product of 
(the possible values for the parent nodes) x (the number of 
the possible values for the child node -1). Given the likelihood 
and the number of parameters, the AIC column is computed 
as A/G = loglikelihood — ^par. Model selection scores other 
than AIC can be computed in a similar way given the model 
likelihood and number of parameters. 

1) Parameter Number Computation: To determine the num¬ 
ber of parameters of the child node @parVar-ID@, the number 
of possible child and parent values can be found from the 
VDB.Domain table in the Random Variable Database. 


2 ) Likelihood Computation: As explained in Section II-A 


the log-likelihood can be computed by multiplying the instan¬ 
tiation counts of a factor by its value. Assuming that instanti¬ 
ation counts are represented in a contingency table and factor 
values in a factor table, this multiplication can be elegantly per¬ 
formed using the Natural Join operator. For instance, the log- 
likelihood score associated with the Capahility{¥^ S) family 
is given by the SQL query below. The aggregate computation 
in this short query illustrates how well SQL constructs support 
complex computations with structured objects. 


SELECT Capability(P,S), SUM 
(MDB.Capability(P,S)_CPT.cp ^ 

CDB.Capability(P,S)_CT.count) 

AS loglikelihood 

EROM MDB.Capability(P,S)_CPT 

NATURAL JOIN CDB.Capability (P,S)_CT 

It is possible to extend the model manager to handle the 
multi-net structure learning method of the leam-and-join algo¬ 
rithm 1^ . The algorithm learns multiple Bayesian networks 
and propagates edges among them. The MDB schema is easily 
extended to store multiple Bayesian networks in a single table. 
The edge propagation can be executed by the RDBMS using 
the view mechanism. For more details, please see 1241 . 

This completes our description of how the modules of 
FactorBase are implemented using SQL. We next show 
how these modules support a key learning task: computing the 
predictions of an SRL model on a test instance. 


VI. Test Set Predictions 

Computing probabilities over the label of a test instance is 
important for several tasks. 1) Classifying the test instance, 
which is one of the main applications of a machine learning 
system for end users. 2 ) Comparing the class labels predicted 
against true class labels is a key step in several approaches to 
model scoring flAj . 3) Evaluating the accuracy of a machine 
learning algorithm by the train-and-test paradigm, where the 
system is provided a training set for learning and then we 
test its predictions on unseen test cases. We first discuss how 
to compute a prediction for a single test case, then how to 
compute an overall prediction score for a set of test cases. 
Class probabilities can be derived from Equation as follows 
IH Sec. 2 . 2 . 2 ]. Let Y denote a ground par-RV to be classified, 
which we refer to as the target variable. For example, a 
ground atom may be Intelligence {jack). In this example, we 
refer to jack as the target entity. Write X_y for a database 
instance that specifies the values of all ground par-RVs, except 
for the target, which are used to predict the target node. Let 
[X_y,^] denote the completed database instance where the 
target node is assigned value y. The log-linear model uses the 
likelihood P([X_y, ^]) as the joint score of the label and the 
predictive features. The conditional probability is proportional 
to this score: 


P(^|X_Y)(xP([X_y,^]) (2) 

where the joint distribution on the right-hand side is defined 
by Equation T] and the scores of the possible class labels need 
to be normalized to define conditional probabilities. 
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SQL Implementation. The obvious approach to computing 
the log-linear score would be to use the likelihood computation 
of Section IV-CI for the entire database. This is inefficient 
because only instance counts that involve the target entity 
change the classification probability. This means that we need 
only consider query instantiations that match the appropriate 
logical variable with the target entity (e.g., § = jack). 

For a given set of random variables, target entity instantia¬ 
tion counts can be represented in a contingency table that we 
call the target contingency table. Figure shows the format 
of a contingency table for target entities jack resp. Jill. 


jack_Capability_(P,S)_CT 


sid 

Count 

Cap.(P,S) 

RA(P,S) Salary(P,S) 

jack 

5 

N/A 

N/A F 

jack 

5 

4 

high T 



jill_Capability_(P,S)_CT 


sid 

Count 

Cap.(P,S) 

RA(P,S) Salary(P,S) 

fill 

3 

N/A 

N/A F 

fill 

7 

4 

high T 



Figure 8. Target contingency tables for target = jack and for target = Jill. 

Assuming that for each node with ID @parRVID@, a tar¬ 
get contingency table named CDB.target_^parRVID^_CT 
has been built in the Count Database CDB, the log- 
likelihood SQL is as in Section |V-C| For instance, the 
contribution of the Capability{F,E>) family is computed 
by the SQL query shown, but with the contingency table 
jack_Capability(P,S)_CT in place of Capability(P,S)_CT. The 
new problem is finding the target contingency table. SQL 
allows us to solve this easily by restricting counts to the target 
entity in the WHERE clause. To illustrate, suppose we want to 
modify the contingency table query of Figure to compute the 
contingency table for S = jack. We add the student id to the 
SELECT clause, and the join condition S.s_id = jack to the 
WHERE clause; see Table |IV|The EROM clause is the same 
as in Eigure The metaquery of Eigure is easily changed 
to produce these SELECT and WHERE clauses. 

Next consider a setting where a model is to be 
scored against an entire test set. Eor concreteness, 
suppose the problem is to predict the intelligence of 
a set of students Intelligence {jack), Intelligence{jill), 
Intelligence{student 3) ^..., Intelligenee{student m) • SQL 
supports block access where we process the test instances as 
a block. Intuitively, instead of building a contingency table 
for each test instance, we build a single contingency table that 
stacks together the individual contingency tables (Eigure [^. 
Blocked access can be implemented in a beautifully simple 
manner in SQL: we simply add the primary key id field for 
the target entity to the GROUP BY list; see Table |IV] 


VH. Evaluation 

Our experimental study describes how EactorBase can 
be used to implement a challenging machine learning applica¬ 
tion: Constructing a Bayesian network model for a relational 
database. Bayesian networks are a good illustration of typical 
challenges and how RDBMS capabilities can address them 
because: (1) Bayesian networks are widely regarded as a very 
useful model class in machine learning and AI, that supports 
decision making and reasoning under uncertainty. At the same 
time, they are considered challenging to learn from data. (2) 
Database researchers have proposed Bayesian networks for 
combining databases with uncertainty Eol. (3) A Bayesian 
network with par-RVs can be easily converted to other first- 
order representations, such as a Markov Logic Network; see 
0 . 

We describe the system and the datasets we used. Code was 
written in MySQL Script and Java, JRE 1.7.0. and executed 
with 8GB of RAM and a single Intel Core 2 QUAD Processor 
Q6700 with a clock speed of 2.66GHz (no hyper-threading). 
The operating system was Linux Centos 2.6.32. The MySQL 
Server version 5.5.34 was run with 8GB of RAM and a single 
core processor of 2.2GHz. All code and datasets are available 
on-line (241. 

A. Datasets 

We used six benchmark real-world databases. Eor detailed 
descriptions and the sources of the databases, please see 
(241 and the references therein. Table |V| summarizes basic 
information about the benchmark datasets. IMDb is the largest 
dataset in terms of number of total tuples (more than 1.3M 
tuples) and schema complexity. It combines the MovieLens 
databas^with data from the Internet Movie Database (IMDbJ^ 
following ( 22 I. 


TABLE V Datasets characteristics. #Tuples = total number 

OE TUPLES OVER ALL TABLES IN THE DATASET. 


Dataset 

#Relationship 
Tables/ Total 

# par-RV 

#Tuples 

Movielens 

1 / 3 

7 

1,010,051 

Mutagenesis 

2/4 

11 

14,540 

UW-CSE 

2/4 

14 

712 

Mondial 

2/4 

18 

870 

Hepatitis 

3 / 7 

19 

12,927 

IMDb 

3 / 7 

17 

1,354,134 


Table [V| provides information about the number of par-RVs 
generated for each database. More complex schemas 
generate more random variables. 

B. Bayesian Network Learning 

Eor learning the structure of a first-order Bayesian network, 
we used EactorBase to implement the previously existing 
leam-and-join algorithm (LAJ). The model search strategy 
of the LAJ algorithm is an iterative deepening search for 

3www.grouplens.org, IM version 
^www.imdb.com, July 2013 

























TABLE IV. SQL queries eor computing target contingency tables supporting test set prediction. <Attribute-List> and 

<Key-Equality-List> are as in Figure|^ 


Access 

SELECT 

WHERE 

GROUP BY 

Single 

COUNT(*) AS count, <Attribute-List>, S.sid 

<Key-Equality-List> AND S.s id = jack 

< Attribute-List > 

Block 

COUNT(*) AS count, <Attribute-List>, S.sid 

< Key-Equality-List> 

<Attribute-List>, S.sid 


correlations among attributes along longer and longer chains 
of relationships. For more details please see ||26]| . The previous 
implementation of the LAJ algorithm posted at 1241 . limits the 
par-factors so they contain at most two relationship par-RVs; 
FactorBase overcomes this limitation. 

A major design decision is how to make sufficient statistics 
available to the LAJ algorithm. In our experiments we followed 
a pre-counting approach where the count manager constructs a 
joint contingency table for all par-RVs in the random variable 
database. An alternative would be on-demand counting, which 
computes many contingency tables, but only for factors that 
are constructed during the model search El. Pre-counting is 
a form of data preprocessing: Once the joint contingency table 
is constructed, local contingency tables can be built quickly by 
summing (Group By). Different structure learning algorithms 
can therefore be run quickly on the same joint contingency 
table. For our evaluation, pre-counting has several advantages. 
(1) Constructing the joint contingency table presents a maxi¬ 
mally challenging task for the count manager. (2) Separating 
counting/data access from model search allows us to assess 
separately the resources required for each task. 


C. Results 

Table |Vl] reports the number of sufficient statistics for 
constructing the joint contingency table. This number depends 
mainly on the number of par-RVs. The number of sufficient 
statistics can be quite large, over 15M for the largest dataset 
IMDb. Even with such large numbers, constructing contin¬ 
gency tables using the SQL metaqueries is feasible, taking 
just over 2 hours for the very large IMDb set. The number of 
Bayesian network parameters is much smaller than the number 
of sufficient statistics. The difference between the number of 
parameters and the number of sufficient statistics measures 
how compactly the BN summarizes the statistical information 
in the data. Table |Vj shows that Bayesian networks provide 
very compact summaries of the data statistics. For instance for 
the Hepatitis dataset, the ratio is 12,374,892/569 > 20,000. 
The IMDb database is an outlier, with a complex correlation 
pattern that leads to a dense Bayesian network structure. 


TABLE VI. Count Manager: Sueeicient Statistics and 
Parameters 


Dataset 

# Database 
Tuples 

# Sufficient 
Statistics (SS) 

SS 

Computing 
Time (s) 

#bn 

Parameters 

Movielens 

1,010,051 

252 

2.7 

292 

Mutagenesis 

14,540 

1,631 

1.67 

721 

UW-CSE 

712 

2,828 

3.84 

241 

Mondial 

870 

1,746,870 

1,112.84 

339 

Hepatitis 

12,927 

12,374,892 

3,536.76 

569 

IMDb 

1,354,134 

15,538,430 

7,467.85 

60,059 


number of parameters. The parameter manager provides fast 
maximum likelihood estimates for a given structure. This is 
because computing a local contingency table for a BN family 
is fast given the joint contingency table. 

TABLE VII. Model Manager Evaluation. 


Dataset 

# Edges in 
Bayes Net 

# Bayes Net 
Parameters 

Parameter 

Learning 

Time (s) 

Movielens 

72 

292 

0.57 

Mutagenesis 

124 

721 

0.98 

UW-CSE 

112 

241 

1.14 

Mondial 

141 

339 

60.55 

Hepatitis 

207 

569 

429.15 

IMDb 

195 

60,059 

505.61 


Figure [^compares computing predictions on a test set using 
an instance-by-instance loop, with a separate SQL query for 
each instance, vs. a single SQL query for all test instances 
as a block (Table [rv|). Table |VIlT| specifies the number of test 
instances for each dataset. We split each benchmark database 
into 80% training data, 20% test data. The test instances are 
the ground atoms of all descriptive attributes of entities. The 
blocked access method is 10-100 faster depending on the 
dataset. The single access method did not scale to the large 
IMDb dataset (timeout after 12 hours). 


TABLE VIII. # OE Test Instances 


Dataset 

Movielens 

Mutagenesis 

UW-CSE 

Mondial 

Hepatitis 

IMDb 

#instance 

4,742 

3,119 

576 

505 

2,376 

46,275 



Movielens Mutagenesis Mondial UW-CSE Hepatitis IMDB 

■ Blocked Instances ■ Instance-by-instance 


Figure 9. Times (s) for Computing Predictions on Test Instances. The right 
red column shows the time for looping over single instances using the Single 
Access Query of Table HY] The left blue column shows the time for the 
Blocked Access Query of Table [Tv] 


Table |VII| shows that the graph structure of a Bayesian 
network contains a small number of edges relative to the 


Table |I^ reports result for the complete learning of a 
Bayesian network, structure and parameters. It benchmarks 
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FactorBase against functional gradient boosting, a state-of- 
the-art multi-relational learning approach. MLN_Boost learns 
a Markov Logic Network, and RDN_Boost a Relational De¬ 
pendency Network. We used the Boostr implementation ca. 
To make the results easier to compare across databases and 
systems, we divide the total running time by the number of par- 
RVs for the database (Table [V|). Table |!x| shows that structure 
learning with FactorBase is fast: even the large complex 
database IMDb requires only around 8 minutes/par-RV. Com¬ 
pared to the boosting methods, FactorBase shows excellent 
scalability: neither boosting method terminates on the IMDb 
database, and while RDN_Boost terminates on the MovieLens 
database, it is almost 5,000 times slower than FactorBase. 
Much of the speed of our implementation is due to quick 
access to sufficient statistics. As the last column of Table Hxl 
shows, on the larger datasets FactorBase spends about 
80% of computation time on gathering sufficient statistics via 
the count manager. This suggests that a large speedup for 
the boosting algorithms could be achieved if they used the 
FactorBase in-database design. 

We do not report accuracy results due to space constraints 
and because predictive accuracy is not the focus of this paper. 
On the standard conditional log-likelihood metric, as defined 
by Equation!^ the model learned by FactorBase performs 
better than the boosting methods on all databases. This is 
consistent with the results of previous studies ll^ . 

TABLE IX. Learning Time Comparison (sec) with other 

STATISTICAL-RELATIONAL LEARNING SYSTEMS. NT = NON-TERMINATION 


Dataset 

RDN Boost 

MLN Boost 

FB-Total 

FB-Count 

MovieLens 

5,562 

N/T 

1.12 

0.39 

Mutagenesis 

118 

49 

1 

0.15 

UW-CSE 

15 

19 

1 

0.27 

Mondial 

27 

42 

102 

61.82 

Hepatitis 

251 

230 

286 

186.15 

IMDb 

N/T 

N/T 

524.25 

439.29 


Conclusion. FactorBase leverages RDBMS capabilities 
for scalable management of statistical analysis objects. It 
efficiently constructs and stores large numbers of sufficient 
statistics and parameter estimates. The RDBMS support for 
statistical-relational learning translates into orders of magni¬ 
tude improvements in speed and scalability. 

VIII. Related Work 

The design space for combining machine learning with data 
management systems offers a number of possibilities, several 
of which have been explored in previous and ongoing research. 
We selectively review the work most relevant to our research. 
Figure [T^ provides a tree structure for the research landscape. 

A. Single-Table Machine Learning 

Most machine learning systems, such as Weka or R, support 
learning from a single table or data matrix only. The single¬ 
table representation is appropriate when the data points rep¬ 
resent a homogeneous class of entities with similar attributes, 
where the attributes of one entity are independent of those of 
others ca. The only way a single-table system can be applied 



Figure 10. A tree structure for related work in the design space of machine 
learning x data management 


to multi-relational data is after a preprocessing step where 
multiple interrelated tables are converted to a single data table. 
When the learning task is classification, such preprocessing is 
often called propositionalization ifT^ . This “fiattening” of the 
relational structure typically involves a loss of information. 

1) RDBMS Learning: Leveraging RDBMS capabilities 
through SQL programming is the unifying idea of the recent 
MADLib framework GD. An advantage of the MADLib 
approach that is shared by FactorBase is that in-database 
processing avoids exporting the data from the input database. 
The Apache Spark m framework includes MLBase and Spark- 
SQL that provide support for distributed processing, SQL, and 
automatic refinement of machine learning algorithms and mod¬ 
els lua. Other RDBMS applications include gathering suffi¬ 
cient statistics la, and convex optimization ||6| . The MauveDB 
system lO emphasizes the importance of several RDBMS 
features for combining statistical analysis with databases. As 
in FactorBase, this includes storing models and associated 
parameters as objects in their own right, and using the view 
mechanism to update statistical objects as the data change. A 
difference is that MauveDB presents model-based views of the 
data to the user, whereas FactorBase presents views of the 
models to machine learning applications. 

2) RDBMS Inference: Wong et al applied SQL operators 
such as the natural join to perform log-linear inference with 
a single-table graphical model |[3^ stored in an RDBMS. 
Monte Carlo methods have also been implemented with an 
RDBMS to perform inference with uncertain data flTj . 1311 . 
The MCDB system ifT^ stores parameters in database tables 
like FactorBase . 

B. Multi-Relational Learning 

For overviews of multi-relational learning please see ITtII. ll4ll. 
m. Most implemented systems, such as Aleph and Alchemy, 
use a logic-based representation of data derived from Prolog 
facts, that originated in the Inductive Logic Programming 
community m 

1) RDBMS Learning: The ClowdFlows system M allows 
a user to specify a MySQL database as a data source, then 
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converts the MySQL data to a single-table representation 
using propositionalization. Singh and Graepel 1^ present an 
algorithm that analyzes the relational database system catalog 
to generate a set of nodes and a Bayesian network structure. 
This approach utilizes SQL constructs as a data description 
language in a way that is similar to our Schema Analyzer. 
Differences include the following. (1) The Bayesian network 
structure is fixed and based on latent variables, rather than 
learned for observable variables only, as in our case study. 
(2) The RDBMS is not used to support learning after random 
variables have been extracted from the schema. 

Qian et al 1^ discuss work related to the contingency 
table problem and introduce contingency table algebra. Their 
paper focuses on a Virtual Join algorithm for computing 
sufficient statistics that involve negated relationships. They do 
not discuss integrating contingency tables with other structured 
objects for multi-relational learning. 

2 ) RDBMS Inference: Database researchers have developed 
powerful probabilistic inference algorithms for multi-relational 
models. The BayesStore system BOl introduced the principle 
of treating all statistical objects as first-class citizens in a 
relational database as FactorBase does. The Tuffy system 
1^ achieves highly reliable and scalable inference for Markov 
Logic Networks (MLNs) with an RDBMS. It leverages infer¬ 
ence capabilities to perform MLN parameter learning. RDBMS 
support for local search parameter estimation procedures, 
rather than closed-form maximum-likelihood estimation, has 
also been explored 0, Cni, ED. 

IX. Conclusion and Future Work 

Compared to traditional learning with a single data ta¬ 
ble, learning for multi-relational data requires new system 
capabilities. In this paper we described FactorBase , a 
system that leverages the existing capabilities of an SQL-based 
RDBMS to support statistical-relational learning. Representa¬ 
tional tasks include specifying metadata about structured first- 
order random variables, and storing the structure of a learned 
model. Computational tasks include storing and constructing 
sufficient statistics, and computing parameter estimates and 
model selection scores. We showed that SQL scripts can be 
used to implement these capabilities, with multiple advan¬ 
tages. These advantages include: 1) Fast program development 
through high-level SQL constructs for complex table and count 
operations. 2) Managing large and complex statistical objects 
that are too big to fit in main memory. For instance, some of 
our benchmark databases require storing and querying millions 
of sufficient statistics. While FactorBase provides good 
solutions for each of these system capabilities in isolation, the 
ease with which large complex statistical-relational objects can 
be integrated via SQL queries is a key feature. Because infor¬ 
mation about random variables, sufficient statistics, and models 
is all represented in relational database tables, a machine 
learning application can access and combine the information 
in a uniform way via SQL queries. 

Empirical evaluation on six benchmark databases showed 
significant scalability advantages from utilizing the RDBMS 
capabilities: Both structure and parameter learning scaled well 


to millions of data records, beyond what previous multi- 
relational learning systems can achieve. 

Future Work. Further potential application areas for FAC¬ 
TORBASE include managing massive numbers of aggregate 
features for classification (231, and collective matrix factoriza¬ 
tion ED, (28l. While our implementation has used simple SQL 
plus indexes, there are opportunities for optimizing RDBMS 
operations for the workloads required by statistical-relational 
structure learning. These include view materialization and the 
key scalability bottleneck of computing multi-relational suf¬ 
ficient statistics. NoSQL databases can exploit a fiexible data 
representation for scaling to very large datasets. However, SRL 
requires count operations for random complex join queries, 
which is a challenge for less structured data representations. 
An important goal is a single RDBMS package for both 
learning and inference that integrates FactorBase with 
inference systems such as BayesStore and Tuffy. There are 
several fundamental system design choices whose trade-offs 
for SRL warrant exploration. These include the choice between 
pre-counting and post-counting sufficient statistics, and using 
main memory vs. RDBMS disk storage. For instance, model 
selection scores can be cached in either main memory or the 
database. Our SQL-based approach facilitates using distributed 
computing systems such as SparkSQL (H, which have shown 
great potential for scalability. In sum, we believe that the suc- 
cesful use of SQL presented in this paper shows that relational 
algebra can play the same role for multi-relational learning as 
linear algebra for single-table learning: a unified language for 
both representing statistical objects and for computing with 
them. 
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Appendix 

We provide details about the Schema Analyzer. Table 
shows the relational schema of the Random Variable Database. 
Figure [m shows dependencies between the tables of this 
schema. 


TABLE X. Schema for Random Variable Database 


Table Name 

Schema 

AttributeColumns 

TABLE NAME, COLUMN NAME 

Domain 

COLUMN NAME, VALUE 

Pvariables 

Pvid, TABLE NAME 

1 Variables 

IVarlD, COLUMN NAME, Pvid 

2Variables 

2VarID, COLUMN_NAME, Pvidl, Pvid2, 
TABLE NAME 

Relationship 

RVarlD, TABLE_NAME, Pvidl, Pvid2, 
COLUMN_NAMEl, COLUMN_NAME2 


INFORMATION_SCHEMA 



Schema_PositionJnfo Schema_Key_lnfo 



AttributeColumns EntityTables ForeignKeyColumns 



Figure 11. Tables Dependency in the Random Variable Database VDB. 


/*AchemaAnalyzer.sql*/ 

DROP SCHEMA IF EXISTS @database@_AchemaAnalyzer; 

CREATE SCHEMA @database@_AchemaAnalyzer; 

CREATE SCHEMA if not exists @database@_BN; 

CREATE SCHEMA if not exists @database@_CT; 

USE @database@_AchemaAnalyzer; 

SET storage_engine=INNODB; 

CREATE TABLE Schema_Key_Info AS SELECT TABLE_NAME, COLUMN_NAME, 
REFERENCED_TABLE_NAME, REFERENCED_COLUMN_NAME, CONSTRAINT_NAME 
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE 
WHERE (KEY_COLUMN_USAGE.TABLE_SCHEMA = 

'@database@') ORDER BY TABLE_NAME; 
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CREATE TABLE Schema_Position_Info AS 
SELECT COLUMNS.TABLE_NAME, 

COLUMNS.COLUMN_NAME, 

COLUMNS.ORDINAL_POSITION EROM 
INEORMATION_SCHEMA.COLUMNS, 

INEORMATION_SCHEMA.TABLES 
WHERE 

(COLUMNS.TABLE_SCHEMA = '@database@' 

AND TABLES.TABLE_SCHEMA = '@database@' 

AND TABLES.TABLE_NAME = COLUMNS.TABLE_NAME 
AND TABLES.TABLE_TYPE = 'BASE TABLE') 

ORDER BY TABLE_NAME; 

CREATE TABLE NoPKeys AS SELECT TABLE_NAME EROM 

Schema_Key_Info 

WHERE 

TABLE_NAME NOT IN (SELECT 
TABLE_NAME 

EROM 

Schema_Key_Info 
WHERE 

CONSTRAINT_NAME LIKE 'PRIMARY'); 

CREATE table NumEntityColumns AS 
SELECT 

TABLE_NAME, COUNT (DISTINCT COLUMN_NAME) num 

EROM 

Schema_Key_Info 
WHERE 

CONSTRAINT_NAME LIKE 'PRIMARY' 

OR REEERENCED_COLUMN_NAME IS NOT NULL 
GROUP BY TABLE_NAME; 

CREATE TABLE TernaryRelations as SELECT TABLE_NAME EROM 
NumEntityColumns 
WHERE 
num > 2; 


CREATE TABLE AttributeColumns AS 
SELECT TABLE_NAME, COLUMN_NAME EROM 
Schema_Position_Info 
WHERE 

(TABLE_NAME , COLUMN_NAME) NOT IN (SELECT 
TABLE_NAME, COLUMN_NAME 

EROM 

KeyColumns) 

and TABLE_NAME NOT IN (SELECT 
TABLE_NAME 

EROM 

NoPKeys) 

and TABLE_NAME NOT IN (SELECT 
TABLE_NAME 

EROM 

TernaryRelations); 

ALTER TABLE AttributeColumns 

ADD PRIMARY KEY (TABLE_NAME,COLUMN_NAME); 

CREATE TABLE InputColumns AS SELECT * EROM 

KeyColumns 

WHERE 

CONSTRAINT_NAME = 'PRIMARY' 

ORDER BY TABLE_NAME; 

CREATE TABLE EoreignKeyColumns AS SELECT * EROM 

KeyColumns 

WHERE 

REEERENCED_COLUMN_NAME IS NOT NULL 
ORDER BY TABLE_NAME; 

ALTER TABLE EoreignKeyColumns 

ADD PRIMARY KEY (TABLE_NAME,COLUMN_NAME, 

REEERENCED_TABLE_NAME); 

CREATE TABLE EntityTables AS 

SELECT distinct TABLE_NAME, COLUMN_NAME 

EROM 

KeyColumns T 


WHERE 

1 = (SELECT 

COUNT(COLUMN_NAME) 

EROM 

KeyColumns T2 
WHERE 

T.TABLE_NAME = T2.TABLE_NAME 

AND CONSTRAINT_NAME = 'PRIMARY'); 

ALTER TABLE EntityTables 

ADD PRIMARY KEY (TABLE_NAME,COLUMN_NAME); 

CREATE TABLE SelfRelationships AS 
SELECT DISTINCT RTablesI.TABLE_NAME 
AS TABLE_NAME, 

RTablesI.REEERENCED_TABLE_NAME AS REEERENCED_TABLE_NAME, 

RTablesI.REEERENCED_COLUMN_NAME AS REEERENCED_COLUMN_NAME EROM 
KeyColumns AS RTablesI, 

KeyColumns AS RTables2 
WHERE 

(RTablesI.TABLE_NAME = RTables2.TABLE_NAME) AND 

(RTablesI.REEERENCED_TABLE_NAME = RTables2.REEERENCED_TABLE_NAME) 

AND 

(RTablesI.REEERENCED_COLUMN_NAME = RTables2.REEERENCED_COLUMN_NAME) 
AND 

(RTablesI.ORDINAL_POSITION < RTables2.ORDINAL_POSITION); 

ALTER TABLE SelfRelationships ADD PRIMARY KEY (TABLE_NAME); 

CREATE TABLE Many_OneRelationships AS 
SELECT KeyColumnsl.TABLE_NAME EROM 
KeyColumns AS KeyColumnsl, 

KeyColumns AS KeyColumns2 
WHERE 

(KeyColumnsl.TABLE_NAME , KeyColumnsl.COLUMN_NAME) IN (SELECT 
TABLE_NAME, COLUMN_NAME 
EROM InputColumns) 

AND (KeyColumns2.TABLE_NAME , KeyColumns2.COLUMN_NAME) IN 
(SELECT TABLE_NAME, COLUMN_NAME 
EROM EoreignKeyColumns) 

AND (KeyColumns2.TABLE_NAME , KeyColumns2.COLUMN_NAME) 

NOT IN (SELECT TABLE_NAME, COLUMN_NAME 
EROM InputColumns); 

CREATE TABLE PVariables AS 

SELECT CONCAT(EntityTables.TABLE_NAME, '0') AS Pvid, 

EntityTables.TABLE_NAME, 

0 AS index_number EROM 

EntityTables 

UNION 

SELECT 

CONCAT(EntityTables.TABLE_NAME, '1') AS Pvid, 

EntityTables.TABLE_NAME, 

1 AS index_number 
EROM 

EntityTables, 

SelfRelationships 
WHERE 

EntityTables.TABLE_NAME = SelfRelationships.REEERENCED_TABLE_NAME 
AND 

EntityTables.COLUMN_NAME = SelfRelationships.REEERENCED_COLUMN_NAME; 
ALTER TABLE PVariables ADD PRIMARY KEY (Pvid); 

CREATE TABLE RelatlonTables AS 

SELECT DISTINCT EoreignKeyColumns.TABLE_NAME, 

EoreignKeyColumns.TABLE_NAME IN (SELECT 
TABLE_NAME 

EROM 

SelfRelationships) AS SelfRelationship, 

EoreignKeyColumns.TABLE_NAME IN (SELECT 
TABLE_NAME 

EROM 

Many_OneRelationships) AS Many_OneRelationship EROM 
EoreignKeyColumns; 

ALTER TABLE RelatlonTables 
ADD PRIMARY KEY (TABLE_NAME); 
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CREATE TABLE iVariables AS 

SELECT CONCAT(''', COLUMN_NAME, ' , Pvid, ''') AS IVarlD, 

COLUMN_NAME, 

Pvid, 

index_number = 0 AS main EROM 
PVariabies 

NATURAL JOIN 
AttributeColumns; 

ALTER TABLE IVariables ADD PRIMARY KEY (IVarlD); 

ALTER TABLE IVariables ADD UNIQUE(Pvid,COLUMN_NAME); 

CREATE TABLE EoreignKeys_pvars AS 
SELECT EoreignKeyColumns.TABLE_NAME, 

EoreignKeyColumns.REEERENCED_TABLE_NAME, 

EoreignKeyColumns.COLUMN_NAME, 

Pvid, 

index_number, 

ORDINAL_POSITION AS ARGUMENT_POSITION EROM 
EoreignKeyColumns, 

PVariabies 

WHERE 

PVariabies.TABLE_NAME = REEERENCED_TABLE_NAME; 

ALTER TABLE EoreignKeys_pvars 
ADD PRIMARY KEY (TABLE_NAME,Pvid, 

ARGUMENT_POSITION) ; 

CREATE table Relationship_MM_NotSelf AS 
SELECT 

CONCAT(' '', 

EoreignKeys_pvarsi.TABLE_NAME, 

EoreignKeys_pvarsl.Pvid, 

EoreignKeys_pvars2.Pvid, 

')' r 

' '') AS orig_RVarID, 

EoreignKeys_pvarsi.TABLE_NAME, 

EoreignKeys_pvars1.Pvid AS Pvidl, 

EoreignKeys_pvars2.Pvid AS Pvid2, 

EoreignKeys_pvarsl.COLUMN_NAME AS C0LUMN_NAME1, 

EoreignKeys_pvars2.COLUMN_NAME AS C0LUMN_NAME2, 
(EoreignKeys_pvarsl.index_number = 0 

AND EoreignKeys_pvars2.index_number = 0) AS main 

EROM 

EoreignKeys_pvars AS EoreignKeys_pvarsl, 

EoreignKeys_pvars AS EoreignKeys_pvars2, 

RelationTables 

WHERE 

EoreignKeys_pvarsl.TABLE_NAME = EoreignKeys_pvars2.TABLE_NAME 
AND RelationTables.TABLE_NAME = EoreignKeys_pvarsl.TABLE_NAME 
AND EoreignKeys_pvarsl.ARGUMENT_POSITION < 
EoreignKeys_pvars2.ARGUMENT_POSITION 
AND RelationTables.SelfRelationship = 0 
AND RelationTables.Many_OneRelationship = 0; 

CREATE table Relationship_MM_Self AS 
SELECT 

CONCAT(' '', 

EoreignKeys_pvarsi.TABLE_NAME, 

EoreignKeys_pvarsl.Pvid, 

EoreignKeys_pvars2.Pvid, 

')' r 

' '') AS orig_RVarID, 

EoreignKeys_pvars1.TABLE_NAME, 

EoreignKeys_pvars1.Pvid AS Pvidl, 

EoreignKeys_pvars2.Pvid AS Pvid2, 

EoreignKeys_pvarsl.COLUMN_NAME AS C0LUMN_NAME1, 

EoreignKeys_pvars2.COLUMN_NAME AS COLUMN_NAME2, 
(EoreignKeys_pvars1.index_number = 0 

AND EoreignKeys_pvars2.index_number = 1) AS main 

EROM 

EoreignKeys_pvars AS EoreignKeys_pvarsl, 

EoreignKeys_pvars AS EoreignKeys_pvars2, 

RelationTables 

WHERE 


EoreignKeys_pvarsl.TABLE_NAME = EoreignKeys_pvars2.TABLE_NAME 
AND RelationTables.TABLE_NAME = EoreignKeys_pvarsl.TABLE_NAME 
AND EoreignKeys_pvarsl.ARGUMENT_POSITION < 

EoreignKeys_pvars2.ARGUMENT_POSITION 
AND 

EoreignKeys_pvarsl.index_number < EoreignKeys_pvars2.index_number 

AND RelationTables.SelfRelationship = 1 

AND RelationTables.Many_OneRelationship = 0; 

CREATE table Relationship_MO_NotSelf AS 
SELECT 

CONCAT(''', 

EoreignKeys_pvars.REEERENCED_TABLE_NAME, 

PVariabies.Pvid, 

EoreignKeys_pvars.Pvid, 

''') AS orig_RVarID, 

EoreignKeys_pvars.TABLE_NAME, 

PVariabies.Pvid AS Pvidl, 

EoreignKeys_pvars.Pvid AS Pvid2, 

KeyColumns.COLUMN_NAME AS COLUMN_NAMEl, 

EoreignKeys_pvars.COLUMN_NAME AS COLUMN_NAME2, 

(PVariabies.index_number = 0 

AND EoreignKeys_pvars.index_number = 0) AS main 

EROM 

EoreignKeys_pvars, 

RelationTables, 

KeyColumns, 

PVariabies 

WHERE 

RelationTables.TABLE_NAME = EoreignKeys_pvars.TABLE_NAME 
AND RelationTables.TABLE_NAME = PVariabies.TABLE_NAME 
AND RelationTables.TABLE_NAME = KeyColumns.TABLE_NAME 
AND RelationTables.SelfRelationship = 0 
AND RelationTables.Many_OneRelationship = 1; 

CREATE table Relationship_MO_Self AS 
SELECT 

CONCAT(''', 

EoreignKeys_pvars.REEERENCED_TABLE_NAME, 

PVariabies.Pvid, 

')=' r 

EoreignKeys_pvars.Pvid, 

''') AS orig_RVarID, 

EoreignKeys_pvars.TABLE_NAME, 

PVariabies.Pvid AS Pvidl, 

EoreignKeys_pvars.Pvid AS Pvid2, 

KeyColumns.COLUMN_NAME AS COLUMN_NAMEl, 

EoreignKeys_pvars.COLUMN_NAME AS COLUMN_NAME2, 

(PVariabies.index_number = 0 

AND EoreignKeys_pvars.index_number =1) AS main 

EROM 

EoreignKeys_pvars, 

RelationTables, 

KeyColumns, 

PVariabies 

WHERE 

RelationTables.TABLE_NAME = EoreignKeys_pvars.TABLE_NAME 
AND RelationTables.TABLE_NAME = PVariabies.TABLE_NAME 
AND RelationTables.TABLE_NAME = KeyColumns.TABLE_NAME 
AND PVariabies.index_number < EoreignKeys_pvars.index_number 
AND RelationTables.SelfRelationship = 1 
AND RelationTables.Many_OneRelationship = 1; 

CREATE TABLE Relationship AS SELECT * EROM 
Relationship_MM_NotSelf 
UNION SELECT 
* 

EROM 

Relationship_MM_Self 
UNION SELECT 

EROM 

Relationship_MO_NotSelf 
UNION SELECT 

EROM 



Relationship_MO_Self; 

ALTER TABLE Relationship ADD PRIMARY KEY (orig_RVarID) 
ALTER TABLE 'Relationship' 

ADD COLUMN 'RVarlD' VARCHAR(IO) NULL , 

ADD UNIQUE INDEX 'RVarID_UNIQUE' ('RVarlD' ASC) ; 


CREATE TABLE 2Variables AS SELECT CONCAT(''', 
COLUMN_NAME, 

PvidI, 

Pvid2, 

')' r 

' '') AS 2VarID, 

COLUMN_NAME, 

PvidI, 

Pvid2, 

TABLE_NAME, 
main EROM 

Relationship NATURAL JOIN AttributeColumns; 

ALTER TABLE 2Variables ADD PRIMARY KEY (2VarID);</p> 



